Time Series Analysis of Spotify Data

Author

Frankie, Anna, Piper, Kate, Ayaan

Introduction

Companies are always looking for methods to analyze the popularity of their products, as well as what predicts that popularity over time. The music industry is no different: Spotify has developed a popularity index that ranks songs from 0 to 100 on how popular they are compared to other songs. The score has three components: total streams of a song, how recently a song has been played, and the frequency at which a song is played. This score is immensely useful to Spotify and to music industry professionals on a song-by-song basis, because having a high popularity score increases a song’s discoverability to listeners. However, modeling mean values of song popularity over time also has the potential to be useful for Spotify (or similar streaming platforms), because it could help companies understand the factors that play into users’ satisfaction with the platform and listening patterns over time, as well as predict future values of popularity that might inform business decisions. We have obtained a Kaggle dataset containing information about audio features of more than 160,000 Spotify songs that were released between 1921 and 2020. While several previous analyses have looked at this data and even examined the relationships between features and popularity, these projects have ignored the implicit potential for temporal autocorrelation in the data. Because the total number of listens for a given song inevitably increases over time, and because frequency and recency of listens depend in large part on media attention and artist activity, the mean song popularity for songs released in one month is likely to be highly related to that of songs released the previous month. Our goal for this project is to create a model of the mean popularity for songs released at given intervals over time using song feature data as predictors and ARIMA errors to account for leftover temporal autocorrelation.

Data Preparation and Exploratory Data Analysis

The original dataset for this project contains nineteen variables, including song name, artist, name, release date, and audio features including acousticness, danceability, loudness and energy. A full codebook of all variables and their definitions is located in the data folder of this repository (spotify_data_codebook.md). The original data contained audio features and identification information for 169,909 songs.

Data Preparation

First, we altered the structure of our data, creating a new dataset (spotify_mean_yearly) that contains each year from 1921 to 2010 and the mean values of the following feature variables: acousticness, danceability, duration, energy, explicit, instrumentalness, key, liveness, loudness, mode, popularity, speechiness, tempo, and valence. We also created a dataset at monthly granularity to establish a regular time interval off which to model. Notably, filtering the release date for only observations that contain monthly data meant we disincluded about 30% of observations (50382 out of 169909 originally), but we don’t have a strong reason to believe that this aggregation will produce bias in the data, given that the time series for yearly mean popularity of songs with and without monthly data matched (see plot).

Bivariate EDA: Relationships between popularity and song features

To continue our EDA and check which of these covariates might have a relationship with our outocme variable, song popularity, we plotted each variable against the response at monthly granularity. We also fit a rudimentary linear model (black) and loess curve (green) to each.

The bivariate EDA (grouped rowwise by behavior), revealed some trends between the predictor and mean monthly song popularity. Danceability, Loudness, and Energy all had relatively strong positive trends with mean song popularity. Instrumentalness, Acousticness, and Liveness all had negative trends with song popularity (higher values of the predictor giving lower song popularity, and vice versa). Speechiness, Duration, and Tempo all had values gathered in one area with various popularity levels, and explicitness had a moderate positve trend with mean song popularity. The loess curve reveals some strange behavior at the very high or very low values of all eight, likely due to the fact that these more extreme popularity values are uncommon.

spotify_decades <- spotify_mean_monthly %>%
  mutate(year = as.numeric(format(month,'%Y'))) %>%
  mutate(decade = year - year %% 10 )
spotify_decades %>%
  ggplot(aes(x=danceability_month_mean, y = popularity_month_mean)) + 
  geom_point() + 
  geom_smooth(color='blue', alpha = 0.75,se = FALSE, method = "lm") + 
  geom_smooth(color = "red", alpha = 0.25, se = FALSE) + 
  facet_wrap(~decade)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

spotify_decades %>%
  ggplot(aes(x=loudness_month_mean, y = popularity_month_mean)) + 
  geom_point() + 
  geom_smooth(color='blue', alpha = 0.75,se = FALSE, method = "lm") + 
  geom_smooth(color = "red", alpha = 0.75, se = FALSE) + 
  facet_wrap(~decade)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

spotify_decades %>%
  ggplot(aes(x=energy_month_mean, y = popularity_month_mean)) + 
  geom_point() + 
  geom_smooth(color='blue', alpha = 0.75,se = FALSE, method = "lm") + 
  geom_smooth(color = "red", alpha = 0.75, se = FALSE) + 
  facet_wrap(~decade)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

spotify_decades %>%
  ggplot(aes(x=instrumentalness_month_mean, y = popularity_month_mean)) + 
  geom_point() + 
  geom_smooth(color='blue', alpha = 0.75,se = FALSE, method = "lm") + 
  geom_smooth(color = "red", alpha = 0.75, se = FALSE) + 
  facet_wrap(~decade)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

spotify_decades %>%
  ggplot(aes(x=acousticness_month_mean, y = popularity_month_mean)) + 
  geom_point() + 
  geom_smooth(color='blue', alpha = 0.75,se = FALSE, method = "lm") + 
  geom_smooth(color = "red", alpha = 0.75, se = FALSE) + 
  facet_wrap(~decade)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Checking cross correlation

#lag correlations of tempo on popularity 
spotify_mean_monthly %>%
  select(month, tempo_month_mean, popularity_month_mean) %>%
  mutate(
    "lag 0" = lag(tempo_month_mean,0),
    "lag 1" = lag(tempo_month_mean,1),
    "lag 2" = lag(tempo_month_mean,2),
    "lag 3" = lag(tempo_month_mean,3),
    "lag 4" = lag(tempo_month_mean,4),
    "lag 5" = lag(tempo_month_mean,5),
    "lag 6" = lag(tempo_month_mean,6),
    "lag 7" = lag(tempo_month_mean,7),
    "lag 8" = lag(tempo_month_mean,8),
    "lag 9" = lag(tempo_month_mean,9),
    "lag 11"= lag(tempo_month_mean,11),
    "lag 10"= lag(tempo_month_mean,10))  %>%
  select(-tempo_month_mean, -month) %>%
  tidyr::gather(lag, tempo_month_mean, -popularity_month_mean) %>%
  group_by(lag) %>%
  summarize(corr = cor(tempo_month_mean, popularity_month_mean, use="complete.obs"))
# A tibble: 12 × 2
   lag     corr
   <chr>  <dbl>
 1 lag 0  0.496
 2 lag 1  0.499
 3 lag 10 0.501
 4 lag 11 0.493
 5 lag 2  0.494
 6 lag 3  0.495
 7 lag 4  0.500
 8 lag 5  0.501
 9 lag 6  0.493
10 lag 7  0.498
11 lag 8  0.495
12 lag 9  0.498
#plot lag CCF
forecast::ggCcf(spotify_mean_monthly$tempo_month_mean, spotify_mean_monthly$popularity_month_mean)

Fitting preliminary ARIMA model to Popularity scores

As part of our EDA, we used a step-by-step process to create an ARIMA model of mean popularity scores over time by year from 1921 to 2020. Ultimately, we will attempt to use an ARIMA component to account for left over temporal autocorrelation of the residuals in our linear model of mean song popularity based on mean song features for a given year; this initial ARIMA model may help us to inform that component by showing us the temporal patterns in song popularity over time.

# covert dataframe to time series
spotify_ts <- xts(spotify_mean_monthly$popularity_month_mean, spotify_mean_monthly$month)
spotify_ts <- as.ts(spotify_ts)

# forecast ARIMA
step1 <- forecast::ggtsdisplay(spotify_ts, points = FALSE)

Based on the time series visualization, we will begin by applying first-order differencing with an \(ARIMA(0,1,0)\) model. We will also store the root mean squared error value to track our progress while iterating through ARIMA models.

# first-order differencing
step2 <- forecast::Arima(
  spotify_ts, order = c(0, 1, 0)
)
forecast::ggtsdisplay(step2$residuals, points = FALSE, lag.max = 36)

# store rmse value
step2_rmse <- yardstick::rmse_vec(
  spotify_ts %>% unclass(),
  step2$fitted %>% unclass()
)

The time series has somewhat improved, but there remains some patterns that we can alleviate with an AR process. We will apply a MA(2) process on top of the first-order differencing with an \(ARIMA(0,1,2)\) model.

# first order differencing
# MA(2)
step3 <- forecast::Arima(
  spotify_ts, order = c(0, 1, 2)
)
forecast::ggtsdisplay(step3$residuals, points = FALSE, lag.max = 36)

# store rmse value
step3_rmse <- yardstick::rmse_vec(
  spotify_ts %>% unclass(),
  step3$fitted %>% unclass()
 )

The AR component somewhat improved the model, bringing the root mean squared error from 3.47 to 2.62. We can confirm our findings using the auto.arima function:

# auto arima on time series
forecast::auto.arima(spotify_ts)
Series: spotify_ts 
ARIMA(0,1,2) with drift 

Coefficients:
          ma1     ma2   drift
      -0.8890  0.0515  0.0659
s.e.   0.0315  0.0315  0.0130

sigma^2 = 6.753:  log likelihood = -2527.33
AIC=5062.67   AICc=5062.7   BIC=5082.55
# model fit
model = "Final Model - forecast::Arima (0,1,2) "
rmse = (spotify_ts-step3$fitted)^2 %>% mean() %>% sqrt() %>% round(3) %>% paste0("   [RMSE: ", . ,"]")

step3 %>%
  {tibble(
    spotify_ts = spotify_ts %>% unclass(),
    model = .$fitted %>% unclass(),
    time  = time(.$fitted) %>% unclass()
  )} %>% 
  tidyr::gather(var, popularity_year_mean, -time) %>%
  mutate(var = forcats::as_factor(var)) %>%
  ggplot(aes(x=time, y=popularity_year_mean, color=var)) + 
    geom_line(alpha=0.75, size=0.8) +
    labs(title = paste(model, rmse), x = "Time", y = "Average Popularity Score",
         color = "")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

After applying first order differencing and an MA(2) component, the model fit appears to be finished! The ACF and PACF do not appear to have a pattern after the final step, and the final model against the given data looks sufficient as well. The model for this data required the following parameters: \(ARIMA(0,1,2)\), and the final root mean squared error is 2.62.

# report ARIMA model
m = as_tsibble(spotify_ts) %>% 
  model(ARIMA(value ~ pdq(0, 1, 2)))

Model fitting

Our goal in creating this model is to predict monthly mean song popularity based on monthly mean song features. From our EDA, we can see that danceability, energy, explicitness, instrumentalness, loudness, and tempo had linear associations with monthly mean song popularity. We will start by fitting a multiple linear regression model with mean monthly popularity as the outcome, the features identified in EDA as predictors, and no temporal component.

# fit model
model1 <- lm(popularity_month_mean ~ danceability_month_mean + energy_month_mean + explicit_month_mean +
               instrumentalness_month_mean + loudness_month_mean + tempo_month_mean, data = spotify_mean_monthly) 

# output model
summary(model1)

Call:
lm(formula = popularity_month_mean ~ danceability_month_mean + 
    energy_month_mean + explicit_month_mean + instrumentalness_month_mean + 
    loudness_month_mean + tempo_month_mean, data = spotify_mean_monthly)

Residuals:
    Min      1Q  Median      3Q     Max 
-88.192  -4.858   0.811   5.789  24.988 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -24.92776    4.95018  -5.036 5.59e-07 ***
danceability_month_mean       5.66734    4.88026   1.161 0.245791    
energy_month_mean            78.90610    3.54047  22.287  < 2e-16 ***
explicit_month_mean          38.47577    3.03301  12.686  < 2e-16 ***
instrumentalness_month_mean -24.56021    1.98068 -12.400  < 2e-16 ***
loudness_month_mean          -0.62190    0.17437  -3.567 0.000378 ***
tempo_month_mean              0.06526    0.03887   1.679 0.093484 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.34 on 1059 degrees of freedom
Multiple R-squared:  0.7456,    Adjusted R-squared:  0.7442 
F-statistic: 517.3 on 6 and 1059 DF,  p-value: < 2.2e-16
# store RMSE and AIC
model1_rmse <- yardstick::rmse_vec(
  spotify_mean_monthly$popularity_month_mean,
  model1$fitted
 )
model1_aic <- AIC(model1)

The model performed decently, with an root mean squared error or 10.309 and an AIC of 8015.144. We want to include some temporal component, so we will move forward by adding a temporal component to our existing model with a cross correlation function lag term on the tempo predictor.

# fit model
model2 = lm(popularity_month_mean ~ danceability_month_mean + energy_month_mean + explicit_month_mean + 
              instrumentalness_month_mean + loudness_month_mean + tempo_month_mean + 
              lag(tempo_month_mean, 10),
            data = spotify_mean_monthly)

# output model
summary(model2)

Call:
lm(formula = popularity_month_mean ~ danceability_month_mean + 
    energy_month_mean + explicit_month_mean + instrumentalness_month_mean + 
    loudness_month_mean + tempo_month_mean + lag(tempo_month_mean, 
    10), data = spotify_mean_monthly)

Residuals:
    Min      1Q  Median      3Q     Max 
-81.740  -4.630   0.888   5.828  22.769 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -54.08750    5.57833  -9.696  < 2e-16 ***
danceability_month_mean       4.93714    4.66043   1.059  0.28967    
energy_month_mean            66.87425    3.56841  18.741  < 2e-16 ***
explicit_month_mean          44.24741    2.97637  14.866  < 2e-16 ***
instrumentalness_month_mean -22.81283    1.93954 -11.762  < 2e-16 ***
loudness_month_mean          -0.48275    0.17034  -2.834  0.00468 ** 
tempo_month_mean              0.05469    0.03711   1.474  0.14079    
lag(tempo_month_mean, 10)     0.32335    0.03186  10.149  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.75 on 1048 degrees of freedom
  (10 observations deleted due to missingness)
Multiple R-squared:  0.7718,    Adjusted R-squared:  0.7702 
F-statistic: 506.2 on 7 and 1048 DF,  p-value: < 2.2e-16
# store RMSE and AIC
model2_rmse <- model2 %>% 
  broom::augment(newdata = spotify_mean_monthly) %>% 
  mutate(rmse = yardstick::rmse_vec(popularity_month_mean, .fitted)) %>% 
  select(rmse) %>% 
  first() %>% 
  pull()

model2_aic <- AIC(model2)

The new model performed better, with an root mean squared error or 9.713 and an AIC of 7816.361. The temporal component improved the model, so we will move forward by visualizing the residuals and considering a cross correlation function lag term on the response.

# visualize residuals
forecast::ggtsdisplay(model2$residuals, points=FALSE)

# visualize lag on residuals
forecast::ggCcf(spotify_mean_monthly$popularity_month_mean, model2$residuals)

# fit model
model3 = lm(popularity_month_mean ~ danceability_month_mean + energy_month_mean + explicit_month_mean + 
              instrumentalness_month_mean + loudness_month_mean + tempo_month_mean + 
              lag(tempo_month_mean, 10) + lag(popularity_month_mean, 10),
            data = spotify_mean_monthly)

# output model
summary(model3)

Call:
lm(formula = popularity_month_mean ~ danceability_month_mean + 
    energy_month_mean + explicit_month_mean + instrumentalness_month_mean + 
    loudness_month_mean + tempo_month_mean + lag(tempo_month_mean, 
    10) + lag(popularity_month_mean, 10), data = spotify_mean_monthly)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.3356  -1.8929  -0.1428   1.7722  18.1351 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    -0.72306    2.10328  -0.344   0.7311    
danceability_month_mean        -0.33879    1.67639  -0.202   0.8399    
energy_month_mean               2.49666    1.49398   1.671   0.0950 .  
explicit_month_mean             4.90253    1.16780   4.198 2.92e-05 ***
instrumentalness_month_mean    -3.61600    0.73364  -4.929 9.61e-07 ***
loudness_month_mean             0.07786    0.06159   1.264   0.2064    
tempo_month_mean                0.00259    0.01335   0.194   0.8463    
lag(tempo_month_mean, 10)       0.02911    0.01197   2.430   0.0152 *  
lag(popularity_month_mean, 10)  0.92436    0.01100  84.047  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.505 on 1047 degrees of freedom
  (10 observations deleted due to missingness)
Multiple R-squared:  0.9705,    Adjusted R-squared:  0.9703 
F-statistic:  4311 on 8 and 1047 DF,  p-value: < 2.2e-16
# store RMSE and AIC
model3_rmse <- model3 %>% 
  broom::augment(newdata = spotify_mean_monthly) %>% 
  mutate(rmse = yardstick::rmse_vec(popularity_month_mean, .fitted)) %>% 
  select(rmse) %>% 
  first() %>% 
  pull()

model3_aic <- AIC(model3)
Warning: Removed 10 rows containing missing values (`geom_line()`).

The third model performed even better than the prior two, with an root mean squared error or 3.49 and an AIC of 5656.424. The temporal components improved the model, but we can include an ARIMA component on the errors instead of the cross correlation terms we added in the second and third models.

# Create matrix of predictors
xreg <- cbind(energy_month_mean=spotify_mean_monthly$energy_month_mean, 
              explicit_month_mean=spotify_mean_monthly$explicit_month_mean,
              danceability_month_mean=spotify_mean_monthly$danceability_month_mean, 
              instrumentalness_month_mean=spotify_mean_monthly$instrumentalness_month_mean,
              loudness_month_mean=spotify_mean_monthly$loudness_month_mean, 
              tempo_month_mean=spotify_mean_monthly$tempo_month_mean
              )

# Remove intercept and isolate response variable
xreg <- xreg[,-1]
popularity <- ts(spotify_mean_monthly$popularity_month_mean)

# fit model
model4 <- auto.arima(popularity, xreg=xreg)

# output model
summary(model4)
Series: popularity 
Regression with ARIMA(0,1,2) errors 

Coefficients:
          ma1     ma2   drift  explicit_month_mean  danceability_month_mean
      -0.8911  0.0523  0.0646               0.6546                   0.2525
s.e.   0.0314  0.0313  0.0128               0.9501                   1.2202
      instrumentalness_month_mean  loudness_month_mean  tempo_month_mean
                          -1.3288               0.0415           -0.0035
s.e.                       0.5643               0.0362            0.0094

sigma^2 = 6.721:  log likelihood = -2522.29
AIC=5062.58   AICc=5062.75   BIC=5107.32

Training set error measures:
                       ME     RMSE      MAE MPE MAPE      MASE         ACF1
Training set -0.001093683 2.581433 1.819034 NaN  Inf 0.7843184 0.0006724013
# store AIC
model4_aic <- AIC(model4)

# visualize predicted versus actual 
spotify_mean_monthly %>%
  mutate(fitted = model4$fitted) %>% 
  ggplot(aes(y = popularity_month_mean, x = month, color)) +
  geom_line() +
  geom_line(aes(y = fitted, color = "pink")) + 
  labs("Predicted vs. Observed, Model 4")

The ARIMA model included alongside the multiple linear regression performed the best in terms of AIC, yielding an AIC of 5062.581.

Discussion

Discussion goes here…